Resubmission

  • Business Understanding
    • Update explanation of what values we are measuring
  • Joint Attributes
    • Matrix Plot
      • Update Column Names for shorter and clearer names
      • Transform some of the features on log scale
    • Pairwise plots (heatmap) to help learn about our categorical variables
      • Determine what features are highly correlated to AV Installed and Enabled Variable
      • AV Build Correlation Plot
      • Relationship between Machine Form Factor and Hardware Configuration

DS 7331 Data Mining

Lab 1

  • Tahir Ahmad
  • Christopher Ballenger
  • Grant Bourzikas
  • Vitaly Briker

Introduction and Business Understanding

If we look back at the last 15-20 years, Cybersecurity has become a very important topic across every person in the world regardless of their age and every business. This point is illustrated by two growth factor- Market Capital and Jobs. The capital market has more than doubled in spend in the US since 2010 and it will triple from 2019 until 2022 reaching to a total market capital of 180 billion. Additionally, because of the growth, there is a need for cybersecurity professionals. The job market is exrtonaray hot because the number of jobs has doubled since 2010 and there will be a total of 6 million jobs in 2019 with a 2 million gap in open positions, or what the industry calls the talent shortage. This appears to reflect a healthy growth market and the cybersecurity industry is booming.

Spend in Cybersecurity Demand for Security Professionals Breaches Over Years
Spend.png Jobs.png Breaches.png
Source: Statista 2019 Source: Frost and Sullivan Source: Identity Threat Resource Center

However, the reality of cybersecurity is that since 2005, the world has seen a growth in the number of annual breaches in companies by 9x. We have seen OPM, JP Morgan, Anthem, Sony, Equifax, Home Depot, Adult Friend Finder, Yahoo!, Target and recently Marriott was just breached resulting in a compromise of 500 million records. While the world has spent hundreds of Billions of dollars and hired over 6 million cybersecurity professionals, the attackers are winning and have been winning as long as the Internet has been around. The cybersecurity attackers are very organized, well-funded, and in some cases, nation state backed. Unlike many industries, attackers or criminals are making millions of dollars attacking companies, like CyptoWall, which made a reported amount of 325 million dollars.

One answer to address the attacker problem is to integrate Data Science into cybersecurity in order to predict the outcomes of attackers utilizing rich data sets. This is a new capability that cybersecurity is starting to use and could hold the key for the future. Many industry pundits believe that data science, machine learning, deep learning, and Artificial intelligence will be the keys to successfully to stop attackers. Over the last few years, companies like McAfee, Symantec, Cylance, and Crowd Strike have developed Machine Learning Behavioral Products, but these are still not 100% effective because the attackers are writing malware that can bypass these algorithms or feature sets. In order to secure the future, the cybersecurity industry must evolve and develop more advanced algorithms to prevent and stop attackers.

In order to help solve this problem, Team 4 is going to predict the probability of whether a Windows machines could get infected by various families of malware. This dataset that will be used is part of a current Microsoft Kaggle competition and was captured by Microsoft in order to help them predict if Kagglers could determine whether machines would be breached. Within the dataset, telemetry containing various properties and the machine infections was generated by combining heartbeat and threat reports collected by Microsoft's endpoint protection solution, Windows Defender.

For this lab, Team 4 will predict if a machine is vulnerable to a malware attack. The response variable HasDetections will be used in order to train and test the dataset. The dataset has more than 300,000 rows and 83 columns. The team will identify key features and joint attributes that determine the classification of whether the machine is susceptible to compromise. Multiple classification algorithms will be used to train the models, for example Logistic Rregression, SVM, and Random Forest in attempt to provide the best accuracy possible. In order to measure the performance of the model, Accuracy, Sensitivity, and Log Loss function are calculated. For model training, a 3-folds Cross Validation will be performed to prevent overfitting.

In [22]:
%matplotlib inline

from IPython.display import display
import numpy as np
import pandas as pd
# import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

#Show all columns
pd.options.display.max_columns = None
pd.options.display.max_rows = 100
In [24]:
#%time train = pd.read_csv("data/train.csv")
#%time final = pd.read_csv("data/final.csv")
%time final = pd.read_csv("data/clean.final.csv")
<string>:2: DtypeWarning: Columns (28) have mixed types. Specify dtype option on import or set low_memory=False.
Wall time: 2min 39s
<string>:2: DtypeWarning: Columns (28) have mixed types. Specify dtype option on import or set low_memory=False.
Wall time: 7.09 s

Data Meaning Type

Each row in this dataset corresponds to a machine, uniquely identified by a MachineIdentifier. Microsoft sampled the data from their telemetry so they could enter into the Kaggle competetion.

The sampling methodology used to create this dataset was designed to meet certain business constraints, both in regards to user privacy as well as the time period during which the machine was running. Malware detection is inherently a time-series problem, but it is made complicated by the introduction of new machines, machines that come online and offline, machines that receive patches, machines that receive new operating systems, etc. While the dataset provided here has been roughly split by time, the complications and sampling requirements mentioned above may mean you may see imperfect agreement between your cross validation, public, and private scores! Additionally, this dataset is not representative of Microsoft customers’ machines in the wild; it has been sampled to include a much larger proportion of malware machines. -Microsoft Kaggle competition

For the purpose of our academic research, we elected to select a single CountryIdentifier. Through our research, we reviewed the percentage breakdown by CountryIdentifier. The largest grouping of Countries share a similar makeup of HasDetections, as a result we have decided to select CountryIdentifier 141, which includes 333k rows.

In [25]:
countries = final.pivot_table(index='CountryIdentifier', columns='HasDetections', 
                        values=['MachineIdentifier'],aggfunc=len, margins=True)

countries = countries.reindex(countries['MachineIdentifier'].sort_values(by='All', ascending=False).index)

# display(countries.MachineIdentifier.head(10))

countries_rate = countries.MachineIdentifier[[0,1]].div(countries.MachineIdentifier[[0,1]].sum(1).astype(float),
                             axis=0)

countries_rate.head(10).plot(kind='barh', 
                    stacked=True) #figsize=(4,8))
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x1bc4a1ca2b0>

Data Definitions

Data Field

Description of Data Field

Data Type

Data Quality

HasDetections

Ground truth and indicates that Malware was detected on the machine and will be the key attribute used to predict whether the machine has been compromised.

Response Variable (1/0)

ProductName

Defender state information e.g. win8defender

Categorical

Removed 96% of its values are the same

EngineVersion

Defender state information e.g. 1.1.12603.0

Categorical

AppVersion

Defender state information e.g. 4.9.10586.0

Categorical

AvSigVersion

Defender state information e.g. 1.217.1014.0

IsBeta

Defender state information e.g. false

Categorical

Removed 100% of its values are the same.

RtpStateBitfield

Is Real Time Transport Protocol Enabled

Categorical

Replacing any null values with -1, indicating unknown

IsSxsPassiveMode

Is Side by Side Assempblies in Passive Mode

Boolean

Removed 99% of its values are the same.

DefaultBrowsersIdentifier

ID for the machine's default browser

Categorical

Removed 93% of its values missing are the same

AVProductStatesIdentifier

ID for the specific configuration of a user's antivirus software

Categorical

Replace null values with field's mode 53447

AVProductsInstalled

Whether AV Products are installed

Categorical

Replace null values with 1, the highest mode

AVProductsEnabled

<p align=left"> Whether or not the AV is enabled

Categorical

Replaced null values with 0, indicating not enabled

HasTpm

True if machine has tpm

Boolean

Removed 99% of its values are the same.

CountryIdentifier

ID for the country the machine is located in

Categorical

Filtered by 141

CityIdentifier

ID for the city the machine is located in

Categorical

Replace null values with field's mode 92213

OrganizationIdentifier

ID for the organization the machine belongs in, organization ID is mapped to both specific companies and -broad industries

Categorical

Replace any null values with 0, indicating no organization

GeoNameIdentifier

ID for the geographic region a machine is located in

Categorical

Replace null values with the field's mode 167

LocaleEnglishNameIdentifier

English name of Locale ID of the current user

Categorical

Platform

Calculates platform name (of OS related properties and processor property)

Categorical

Processor

This is the process architecture of the installed operating system

Categorical

OsVer

Version of the current operating system

Categorical

OsBuild

Build of the current operating system

Categorical

OsSuite

Product suite mask for the current operating system.

Categorical

OsPlatformSubRelease

Returns the OS Platform sub-release (Windows Vista, Windows 7, Windows 8, TH1, TH2)

Categorical

OsBuildLab

Build lab that generated the current OS. Example: 9600.17630.amd64fre.winblue_r7.150109-2022

Categorical

Replace null value with the field's model that is aligned by OSBuild and Processor

SkuEdition

The goal of this feature is to use the Product Type defined in the MSDN to map to a 'SKU-Edition' name that is useful in population reporting. The valid Product Type are defined in %sdxroot%\data\windowseditions.xml. This API has been used since Vista and Server 2008, so there are many Product Types that do not apply to Windows 10. The 'SKU-Edition' is a string value that is in one of three classes of results. The design must hand each class.

Categorical

IsProtected

This is a calculated field derived from the Spynet Report's AV Products field. Returns: a. TRUE if there is at least one active and up-to-date antivirus product running on this machine. b. FALSE if there is no active AV product on this machine, or if the AV is active, but is not receiving the latest updates. c. null if there are no Anti Virus Products in the report. Returns: Whether a machine is protected.

Boolean

AutoSampleOptIn

This is the SubmitSamplesConsent value passed in from the service, available on CAMP 9+

Categorical

Removed 99.99% of its values are the same.

PuaMode

Pua Enabled mode from the service

Boolean

Removed 99.99% of its values are missing

SMode

This field is set to true when the device is known to be in 'S Mode', as in, Windows 10 S mode, where only Microsoft Store apps can be installed

Boolean

Removed 93% of its values are the same

IeVerIdentifier

Determination of the last known state of the operating system

Categorical

Replace null values with -1, indicating unknown

SmartScreen

This is the SmartScreen enabled string value from registry. This is obtained by checking in order, HKLM\SOFTWARE\Policies\Microsoft\Windows\System\SmartScreenEnabled and HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\SmartScreenEnabled. If the value exists but is blank, the value "ExistsNotSet" is sent in telemetry.

Categorical

Replaced all missings values with ExistsNotSet along with any special character values.

Firewall

This attribute is true (1) for Windows 8.1 and above if windows firewall is enabled, as reported by the service.

Boolean

UacLuaenable

This attribute reports whether or not the "administrator in Admin Approval Mode" user type is disabled or enabled in UAC. The value reported is obtained by reading the regkey HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\EnableLUA.

Boolean

Removed 99% of its values are the same

Census_MDC2FormFactor

A grouping based on a combination of Device Census level hardware characteristics. The logic used to define Form Factor is rooted in business and industry standards and aligns with how people think about their device. (Examples: Smartphone, Small Tablet, All in One, Convertible...)

Categorical

Census_DeviceFamily

AKA DeviceClass. Indicates the type of device that an edition of the OS is intended for. Example values: Windows.Desktop, Windows.Mobile, and iOS.Phone

Categorical

Removed 99% of its values are the same

Census_OEMNameIdentifier

OEM Identifier COde

Categorical

Replace null values with Census_MDC2FormFactor's mode

Census_OEMModelIdentifier

OEM Model Identifier

Categorical

Replace null values with Census_MDC2FormFactor's mode

Census_ProcessorCoreCount

Number of logical cores in the processor

Continous

Replace null values with Census_MDC2FormFactor's median

Census_ProcessorManufacturerIdentifier

Processor Manufacturer

Categorical

Replace null values with Census_MDC2FormFactor's mode

Census_ProcessorModelIdentifier

Processor Model

Categorical

Replace null values with Census_MDC2FormFactor's mode

Census_ProcessorClass

A classification of processors into high/medium/low. Initially used for Pricing Level SKU. No longer maintained and updated

Categorical

Removed 99% of its values are the same

Census_PrimaryDiskTotalCapacity

Amount of disk space on primary disk of the machine in MB

Continous

Replace null values with Census_MDC2FormFactor's median

Census_PrimaryDiskTypeName

Friendly name of Primary Disk Type HDD or SSD

Categorical

Replace null values with Census_MDC2FormFactor's mode and replace Unspecified with Other

Census_SystemVolumeTotalCapacity

The size of the partition that the System volume is installed on in MB

Continous

Replace null values with Census_MDC2FormFactor's median

Census_HasOpticalDiskDrive

True indicates that the machine has an optical disk drive (CD/DVD)

Boolean

Census_TotalPhysicalRAM

Retrieves the physical RAM in MB

Continous

Replace null values with Census_MDC2FormFactor's median

Census_ChassisTypeName

Retrieves a numeric representation of what type of chassis the machine has. A value of 0 means xx

Categorical

Replace null values with Census_MDC2FormFactor's mode and Updated all unknown or numerical values to Other

Census_InternalPrimaryDiagonalDisplaySizeInInches

Retrieves the physical diagonal length in inches of the primary display

Continous

Replace null values with Census_MDC2FormFactor's median

Census_InternalPrimaryDisplayResolutionHorizontal

Retrieves the number of pixels in the horizontal direction of the internal display.

Continous

Replace null values with Census_MDC2FormFactor's median

Census_InternalPrimaryDisplayResolutionVertical

Retrieves the number of pixels in the vertical direction of the internal display

Continous

Replace null values with Census_MDC2FormFactor's median

Census_PowerPlatformRoleName

Indicates the OEM preferred power management profile. This value helps identify the basic form factor of the device

Categorical

Replace any null and Unknown values with Other

Census_InternalBatteryType

Tye of Battert

Categorical

Removed 70% missing data and unable to classify values by Census_MDC2FormFactor

Census_InternalBatteryNumberOfCharges

Number of Charges on Battery

Continous

Replace null values with Census_MDC2FormFactor's median

Census_OSVersion

Numeric OS version Example 10.0.10130.0

Categorical

Census_OSArchitecture

Architecture on which the OS is based. Derived from OSVersionFull. Example amd64

Categorical

Census_OSBranch

Branch of the OS extracted from the OsVersionFull. Example OsBranch = fbl_partner_eeap where OsVersion = 6.4.9813.0.amd64fre.fbl_partner_eeap.140810-0005

Categorical

Census_OSBuildNumber

OS Build number extracted from the OsVersionFull. Example OsBuildNumber = 10512 or 10240

Categorical

Census_OSBuildRevision

OS Build revision extracted from the OsVersionFull. Example OsBuildRevision = 1000 or 16458

Categorical

Census_OSEdition

Edition of the current OS. Sourced from HKLM\Software\Microsoft\Windows NT\CurrentVersion@EditionID in registry. Example: Enterprise

Categorical

Census_OSSkuName

OS edition friendly name (currently Windows only)

Categorical

Census_OSInstallTypeName

Friendly description of what install was used on the machine i.e. clean

Categorical

Census_OSInstallLanguageIdentifier

Language of Installer

Categorical

Replace null values with Unknown, unable to classify mode

Census_OSUILocaleIdentifier

Locale of Installer

Categorical

Census_OSWUAutoUpdateOptionsName

Friendly name of the WindowsUpdate auto-update settings on the machine.

Categorical

Census_IsPortableOperatingSystem

Indicates whether OS is booted up and running via Windows-To-Go on a USB stick.

Boolean

Removed 99% of its values are the same.

Census_GenuineStateName

Friendly name of OSGenuineStateID. 0 = Genuine

Boolean

Census_ActivationChannel

Retail license key or Volume license key for a machine.

Categorical

Census_IsFlightingInternal

Is Driver Flighting Enabled

Boolean

Updated null values to 0

Census_IsFlightsDisabled

Indicates if the machine is participating in flighting.

Boolean

Removed 98% of its values are the same \ missing

Census_FlightRing

The ring that the device user would like to receive flights for. This might be different from the ring of the OS which is currently installed if the user changes the ring after getting a flight from a different ring.

Categorical

Census_ThresholdOptIn

Driver Device Threshold Enabled

Boolean

Removed 99.99% of its values are the same

Census_FirmwareManufacturerIdentifier

Manufcator of Firemware

Categorical

Replace null values with Census_MDC2FormFactor's mode

Census_FirmwareVersionIdentifier

Version of Firemware

Categorical

Replace null values with Census_MDC2FormFactor's mode

Census_IsSecureBootEnabled

Indicates if Secure Boot mode is enabled.

Boolean

Census_IsWIMBootEnabled

Is image file boot enabled

Boolean

Removed 100% of its values are the same

Census_IsVirtualDevice

Identifies a Virtual Machine (machine learning model)

Boolean

Removed 99% of its values are the same.

Census_IsTouchEnabled

Is this a touch device ?

Boolean

Census_IsPenCapable

Is the device capable of pen input ?

Boolean

Census_IsAlwaysOnAlwaysConnectedCapable

Retreives information about whether the battery enables the device to be AlwaysOnAlwaysConnected.

Boolean

Wdft_IsGamer

Indicates whether the device is a gamer device or not based on its hardware combination.

Boolean

Wdft_RegionIdentifier

Region Code

Categorical

Replaced any null values with its mode 10

Data Quality

In order to understand the quality of the data, we did an extensive review of the data and determined which fields should be considered categorical, continous, or boolean. We converted 23 id and category fields to object as well as removed outliers to make values boolean.

In order to help provide easier development, we created 4 array of column names: cols_booleans, cols_numerical, cols_categorical, and cols_categorical_large. The values with over 100 possible values were moved to its own bucket, cols_categorical_large, requiring special care to review in order to determine if we could cluster values together.

The remaining code are steps we took to clean our data.

In [26]:
# Load Column groups to help with data wrangling
%run -i ColumnArrays.py
<Figure size 432x288 with 0 Axes>
In [27]:
print( "cols_booleans" ) 
display( cols_booleans )
print( "cols_numerical" )
display( cols_numerical )
print( "cols_categorical" )
display( cols_categorical )
print( "cols_categorical_large" )
display( cols_categorical_large )

#Convert features to right data type
final[cols_categorical] = final[cols_categorical].astype(object)
final[cols_categorical_large] = final[cols_categorical_large].astype(object)
cols_booleans
['IsProtected',
 'Firewall',
 'Census_HasOpticalDiskDrive',
 'Census_IsFlightingInternal',
 'Census_IsSecureBootEnabled',
 'Census_IsTouchEnabled',
 'Census_IsPenCapable',
 'Census_IsAlwaysOnAlwaysConnectedCapable',
 'Wdft_IsGamer']
cols_numerical
['Census_ProcessorCoreCount',
 'Census_PrimaryDiskTotalCapacity',
 'Census_SystemVolumeTotalCapacity',
 'Census_TotalPhysicalRAM',
 'Census_InternalPrimaryDiagonalDisplaySizeInInches',
 'Census_InternalPrimaryDisplayResolutionHorizontal',
 'Census_InternalPrimaryDisplayResolutionVertical']
cols_categorical
['EngineVersion',
 'RtpStateBitfield',
 'AVProductsInstalled',
 'AVProductsEnabled',
 'OrganizationIdentifier',
 'Platform',
 'Processor',
 'OsVer',
 'OsBuild',
 'OsSuite',
 'OsPlatformSubRelease',
 'SkuEdition',
 'SmartScreen',
 'Census_MDC2FormFactor',
 'Census_ProcessorManufacturerIdentifier',
 'Census_PrimaryDiskTypeName',
 'Census_ChassisTypeName',
 'Census_PowerPlatformRoleName',
 'Census_OSArchitecture',
 'Census_OSBranch',
 'Census_OSBuildNumber',
 'Census_OSEdition',
 'Census_OSSkuName',
 'Census_OSInstallTypeName',
 'Census_OSInstallLanguageIdentifier',
 'Census_OSUILocaleIdentifier',
 'Census_OSWUAutoUpdateOptionsName',
 'Census_GenuineStateName',
 'Census_ActivationChannel',
 'Census_FlightRing',
 'Wdft_RegionIdentifier']
cols_categorical_large
['AppVersion',
 'AvSigVersion',
 'AVProductStatesIdentifier',
 'CityIdentifier',
 'GeoNameIdentifier',
 'OsBuildLab',
 'IeVerIdentifier',
 'Census_OEMNameIdentifier',
 'Census_OEMModelIdentifier',
 'Census_ProcessorModelIdentifier',
 'Census_OSVersion',
 'Census_OSBuildRevision',
 'Census_FirmwareManufacturerIdentifier',
 'Census_FirmwareVersionIdentifier',
 'LocaleEnglishNameIdentifier']

1) Understand scope of missing data \ fill rate on values

The following section outlines the percentage of missing data and fields with frequency of same values. A benefit of reviewing the percentage of features with the same value is to see which fields we can easily provide median or mode values and others that may be hard to provide a good value for.

The top 3 fields, PUAMode, Census_ProcessorClass, and DefaultBrowsersIdentifier had over 93% of values missing. These values were exluded from our report due to lack of data in order to predict our hypotheses.

In [28]:
#create new summary table
temp_data = []

for col in final.columns:
    temp_data.append((col, final[col].isnull().sum() * 100 / final.shape[0],
                final[col].value_counts(normalize=True, dropna=False).values[0] * 100,
                final[col].dtype))

temp_data = pd.DataFrame(temp_data, columns=['Attribute','Missing_values [%]',
                                 'Higher_frequency_data [%]','Data_type'])

#sort by Freqency values
display( temp_data.loc[temp_data["Missing_values [%]"] > 0,:]
        .sort_values(by =[temp_data.columns[1]], ascending=False) )
Attribute Missing_values [%] Higher_frequency_data [%] Data_type
28 PuaMode 99.992802 99.992802 object
41 Census_ProcessorClass 99.519512 99.519512 object
8 DefaultBrowsersIdentifier 93.517610 93.517610 float64
68 Census_IsFlightingInternal 83.188317 83.188317 float64
52 Census_InternalBatteryType 70.148855 70.148855 object
71 Census_ThresholdOptIn 62.470644 62.470644 float64
75 Census_IsWIMBootEnabled 62.352772 62.352772 float64
31 SmartScreen 30.604569 44.501231 object
29 SMode 7.800283 92.132833 float64
15 OrganizationIdentifier 4.041558 79.914580 object
53 Census_InternalBatteryNumberOfCharges 3.557771 60.295551 float64
81 Wdft_RegionIdentifier 3.280336 96.186988 object
80 Wdft_IsGamer 3.280336 76.631545 float64
69 Census_IsFlightsDisabled 1.959443 98.040557 float64
14 CityIdentifier 1.903057 17.186895 object
72 Census_FirmwareManufacturerIdentifier 1.604026 22.558944 object
73 Census_FirmwareVersionIdentifier 1.442964 1.841871 object
32 Firewall 0.994568 96.996800 float64
46 Census_TotalPhysicalRAM 0.930383 46.469673 float64
79 Census_IsAlwaysOnAlwaysConnectedCapable 0.904289 95.713999 float64
37 Census_OEMModelIdentifier 0.824208 6.065187 object
36 Census_OEMNameIdentifier 0.766921 20.298370 object
62 Census_OSInstallLanguageIdentifier 0.630153 55.428285 object
44 Census_SystemVolumeTotalCapacity 0.614257 1.029060 float64
42 Census_PrimaryDiskTotalCapacity 0.614257 36.263351 float64
48 Census_InternalPrimaryDiagonalDisplaySizeInInches 0.554571 25.478464 float64
49 Census_InternalPrimaryDisplayResolutionHorizontal 0.553371 66.608180 float64
50 Census_InternalPrimaryDisplayResolutionVertical 0.553371 71.578022 float64
30 IeVerIdentifier 0.540174 30.587773 object
11 AVProductsEnabled 0.475389 96.653680 object
9 AVProductStatesIdentifier 0.475389 59.529230 object
10 AVProductsInstalled 0.475389 63.650269 object
26 IsProtected 0.473290 93.694869 float64
40 Census_ProcessorModelIdentifier 0.469391 6.227449 object
39 Census_ProcessorManufacturerIdentifier 0.469391 75.951903 object
38 Census_ProcessorCoreCount 0.469391 47.090528 float64
76 Census_IsVirtualDevice 0.447796 99.135002 float64
6 RtpStateBitfield 0.419602 96.998899 object
43 Census_PrimaryDiskTypeName 0.164062 79.031886 object
33 UacLuaenable 0.150265 99.631686 float64
47 Census_ChassisTypeName 0.011097 62.918440 object
16 GeoNameIdentifier 0.001800 88.886389 object
51 Census_PowerPlatformRoleName 0.000900 72.996692 object
24 OsBuildLab 0.000300 28.018272 object
In [29]:
#sort by Freqency values
display( temp_data.loc[temp_data["Higher_frequency_data [%]"] > 80,:]
        .sort_values(by =[temp_data.columns[2]], ascending=False) )
Attribute Missing_values [%] Higher_frequency_data [%] Data_type
5 IsBeta 0.000000 100.000000 int64
13 CountryIdentifier 0.000000 100.000000 int64
27 AutoSampleOptIn 0.000000 99.995201 int64
28 PuaMode 99.992802 99.992802 object
65 Census_IsPortableOperatingSystem 0.000000 99.957410 int64
35 Census_DeviceFamily 0.000000 99.891125 object
33 UacLuaenable 0.150265 99.631686 float64
41 Census_ProcessorClass 99.519512 99.519512 object
76 Census_IsVirtualDevice 0.447796 99.135002 float64
1 ProductName 0.000000 98.952944 object
12 HasTpm 0.000000 98.795181 int64
7 IsSxsPassiveMode 0.000000 98.637717 int64
69 Census_IsFlightsDisabled 1.959443 98.040557 float64
78 Census_IsPenCapable 0.000000 97.568167 int64
6 RtpStateBitfield 0.419602 96.998899 object
32 Firewall 0.994568 96.996800 float64
11 AVProductsEnabled 0.475389 96.653680 object
81 Wdft_RegionIdentifier 3.280336 96.186988 object
79 Census_IsAlwaysOnAlwaysConnectedCapable 0.904289 95.713999 float64
20 OsVer 0.000000 95.713399 object
18 Platform 0.000000 95.612622 object
45 Census_HasOpticalDiskDrive 0.000000 94.818107 int64
70 Census_FlightRing 0.000000 93.997199 object
26 IsProtected 0.473290 93.694869 float64
8 DefaultBrowsersIdentifier 93.517610 93.517610 float64
29 SMode 7.800283 92.132833 float64
66 Census_GenuineStateName 0.000000 91.314924 object
16 GeoNameIdentifier 0.001800 88.886389 object
55 Census_OSArchitecture 0.000000 88.734925 object
19 Processor 0.000000 88.723227 object
17 LocaleEnglishNameIdentifier 0.000000 86.775181 object
77 Census_IsTouchEnabled 0.000000 86.477951 int64
68 Census_IsFlightingInternal 83.188317 83.188317 float64

2) Data Imputation

The dataset was analyzed and three key classifications of data were utilized:

  • Categorical Variables: The process followed for categorical was to identify its mode for each feature to replace with. However, to try and prevent as much bias in our data, we would factor in another feature to help classify our features. For example, OsBuildLab missing values were derived from OSBuild and Processor.

  • Continous Variables: All the continous variables' values were derived from its median by grouping Census_MDC2FormFactor. We were able to classify how much RAM a machine has based on if it was a Tablet or a Server.

  • Boolean Variables: Any values missing or we felt was an outlier was imputed witht he value of 0.

A description of all the values updated or removed can be found in the section "Data Meaning Type".
In [30]:
# Smart Screen fill miising values and fix characters issue

final.SmartScreen.fillna('ExistsNotSet', inplace=True)
final.SmartScreen.replace({"off":"Off","00000000":"ExistsNotSet","&#x02;" :"ExistsNotSet",
                                 "&#x01;" :"ExistsNotSet"},inplace=True)

# currently renamed "Census_PrimaryDiskTypeName" unknown data into one category
final.Census_PrimaryDiskTypeName.replace({"Unspecified":"Other"},inplace=True)

# currently renamed "Census_ChassisTypeName" unknown data into one category

final.Census_ChassisTypeName.replace({"UNKNOWN":"Other","Unknown":"Other","0" :"Other",
                                "30" :"Other",
                                "35" :"Other",
                                "112" :"Other",
                                "76" :"Other",
                                "39" :"Other"},inplace=True)

# currently renamed "Census_PowerPlatformRoleName" unknown data into one category

final.Census_PowerPlatformRoleName.fillna('Other', inplace=True)

final.Census_PowerPlatformRoleName.replace({"UNKNOWN":"Other"},inplace=True)

#IsFlightsDisabled and IsFlightingInternal  
final.Census_IsFlightingInternal.fillna("1", inplace=True)
final.Census_IsFlightsDisabled.fillna("1", inplace=True)

#Wdft_RegionIdentifier
final.Wdft_RegionIdentifier.astype(float,inplace=True)
final.Wdft_RegionIdentifier.fillna("10",inplace=True)
final.Wdft_RegionIdentifier.astype(object,inplace=True)

#OrganizationIdentifier
final.OrganizationIdentifier.fillna("0",inplace=True)

#RtpStateBitfield
final.RtpStateBitfield.fillna("-1",inplace=True)

#CityIdentifier
final.CityIdentifier.astype(float,inplace=True)
final.CityIdentifier.fillna("92213",inplace=True)
final.CityIdentifier.astype(object,inplace=True)

#Census_OSInstallLanguageIdentifier unable to classify the value
final.Census_OSInstallLanguageIdentifier.fillna("Unknown",inplace=True)

#AVProductsInstalled 1 was the highest mode
final.AVProductsInstalled.astype(float, inplace=True)
final.AVProductsInstalled.fillna(1,inplace=True)
final.AVProductsInstalled.astype(object, inplace=True)

#AVProductsEnabled default to 0 if no value
final.AVProductsEnabled.astype(float,inplace=True)
final.AVProductsEnabled.fillna(0,inplace=True)

#AVProductStatesIdentifier default to 53447 highest mode
final.AVProductStatesIdentifier.astype(float, inplace=True)
final.AVProductStatesIdentifier.fillna(53447,inplace=True)
final.AVProductStatesIdentifier.astype(object, inplace=True)

#GeoNameIdentifier167.0
final.GeoNameIdentifier.fillna("167",inplace=True)

#OsBuildLab match by OSBuild and Processor
final.OsBuildLab.fillna("17134.1.amd64fre.rs4_release.180410-1804",inplace=True)

#IeVerIdentifier default -1, unable to classify
final.IeVerIdentifier.fillna("-1",inplace=True)

#Defautl all boolean nulls to 0
final.IsProtected.fillna(0,inplace=True)
final.Firewall.fillna(0,inplace=True)
final.UacLuaenable.fillna(0,inplace=True)
final.Census_ThresholdOptIn.fillna(0,inplace=True)
final.Census_IsWIMBootEnabled.fillna(0,inplace=True)
final.Census_IsVirtualDevice.fillna(0,inplace=True)
final.Census_IsAlwaysOnAlwaysConnectedCapable.fillna(0,inplace=True)
final.Wdft_IsGamer.fillna(0,inplace=True)
In [31]:
cols = np.append(["Census_MDC2FormFactor"], cols_numerical)

df_grouped = final.groupby("Census_MDC2FormFactor").agg({
    "Census_ProcessorCoreCount":"median",
    "Census_PrimaryDiskTotalCapacity":"median",
    "Census_SystemVolumeTotalCapacity":"median",
    "Census_TotalPhysicalRAM":"median",
    "Census_InternalPrimaryDiagonalDisplaySizeInInches":"median",
    "Census_InternalPrimaryDisplayResolutionHorizontal":"median",
    "Census_InternalPrimaryDisplayResolutionVertical":"median",
    "Census_InternalBatteryNumberOfCharges":"median"
})

# Find the mode instead of median
cols = [
    "Census_FirmwareManufacturerIdentifier",
    "Census_FirmwareVersionIdentifier",
    "Census_OEMNameIdentifier",
    "Census_OEMModelIdentifier",
#     "Census_ProcessorModelIdentifier",
    "Census_ProcessorManufacturerIdentifier",
    "Census_PrimaryDiskTypeName",
    "Census_ChassisTypeName"
]
for col in cols:
    imputes = final.groupby(["Census_MDC2FormFactor",col]).agg({
    "MachineIdentifier":"count"})

    imputes.reset_index(level=[col], inplace=True)
    # Max count to identify the right firmware
    idx = imputes.groupby(["Census_MDC2FormFactor"])['MachineIdentifier'].transform(max) == imputes['MachineIdentifier']
    imputes.drop(["MachineIdentifier"], axis=1, inplace=True)
    # Merge df_grouped
    df_grouped = pd.merge(df_grouped,imputes[idx], left_index = True, right_index = True, how="inner")


# Census_ProcessorModelIdentifier
# Accounting for categories with same counts.  Picking the first one
imputes = final.groupby(["Census_MDC2FormFactor","Census_ProcessorModelIdentifier"]).agg({
"MachineIdentifier":"count"})

imputes.reset_index(level=["Census_MDC2FormFactor","Census_ProcessorModelIdentifier"], inplace=True)
# Max count to identify the right firmware
idx = imputes.groupby(["Census_MDC2FormFactor"])['MachineIdentifier'].transform(max) == imputes['MachineIdentifier']
# Merge df_grouped

imputes = imputes[idx]

imputes['row_id'] = imputes.groupby(['Census_MDC2FormFactor']).cumcount()+1

imputes = imputes.loc[imputes["row_id"]==1,]
imputes.set_index('Census_MDC2FormFactor',inplace=True)

imputes.drop(["MachineIdentifier","row_id"], axis=1, inplace=True)

df_grouped = pd.merge(df_grouped,imputes, left_index = True, right_index = True, how="inner")
In [32]:
for name, group in df_grouped.items():
    for key, value in group.items():
        mask = final["Census_MDC2FormFactor"]==key
#         print( name + "." + key + " Value: " + str(value) )
#         print( "Updating: " + str(final.loc[final[name].isnull() & mask, name].shape) )
        final.loc[final[name].isnull() & mask, name] = value
        final[name].replace(-1,value,inplace=True)   
        
In [33]:
# Save Clean File
# final.to_csv("data/clean.final.csv")

Simple Statistics

The features below is represents the hardware configurations on a machine. Reviewing each data set, it appears majority of the machines in our data are personal desktops. They are a typical desktop with 4 cores, 4 gigabytes of memory, and standard resolution on the monitor. The majority are desktops since majority has no battery charges.

This data gives us a better understanding what our population of users could be. One of the hardest aspect of this dataset is not knowing what type of user would be suspectable to Malware. It appears these machines are individuals vs. server or custom corproate builds.

In [34]:
final[cols_numerical].describe()
Out[34]:
Census_ProcessorCoreCount Census_PrimaryDiskTotalCapacity Census_SystemVolumeTotalCapacity Census_TotalPhysicalRAM Census_InternalPrimaryDiagonalDisplaySizeInInches Census_InternalPrimaryDisplayResolutionHorizontal Census_InternalPrimaryDisplayResolutionVertical
count 333411.000000 3.334110e+05 3.334110e+05 333411.000000 333411.000000 333411.000000 333411.000000
mean 3.195485 5.688072e+05 5.332372e+05 5268.058777 15.860255 1436.107384 827.462102
std 1.568252 3.476213e+05 3.405699e+05 3858.531097 4.964178 250.134264 144.373430
min 1.000000 1.022700e+04 9.676000e+03 512.000000 5.500000 320.000000 240.000000
25% 2.000000 3.052450e+05 2.905430e+05 4096.000000 13.900000 1366.000000 768.000000
50% 4.000000 4.769400e+05 4.636820e+05 4096.000000 15.500000 1366.000000 768.000000
75% 4.000000 9.538690e+05 9.269920e+05 8192.000000 15.600000 1366.000000 800.000000
max 64.000000 1.716940e+07 1.716879e+07 786432.000000 142.000000 6016.000000 3840.000000

There is also some evidence that machines with higher performance may be more vulnerable to malware.

In [35]:
cols = np.append(["HasDetections"],cols_numerical)
final[cols].groupby(by='HasDetections').mean()
Out[35]:
Census_ProcessorCoreCount Census_PrimaryDiskTotalCapacity Census_SystemVolumeTotalCapacity Census_TotalPhysicalRAM Census_InternalPrimaryDiagonalDisplaySizeInInches Census_InternalPrimaryDisplayResolutionHorizontal Census_InternalPrimaryDisplayResolutionVertical
HasDetections
0 3.169518 546822.643511 512973.714577 5081.171897 15.734224 1431.957352 829.135312
1 3.219487 589127.574981 551966.928248 5440.799409 15.976746 1439.943283 825.915545

To better understand its higher mean, we perform visualization for different hardware configurations between machines having a detection or not. The box plots were view on the log scale, with the exception of Census_ProcessorCoreCount.

Visually, only Census_PrimaryDiskTotalCapacity shows a difference of its mean and median between machines being attacked.

In [36]:
#Most of our data has processor count <10
final.boxplot(column='Census_ProcessorCoreCount', by = 'HasDetections')
plt.suptitle("")
Out[36]:
Text(0.5,0.98,'')
In [37]:
ax = final.boxplot(
    column='Census_PrimaryDiskTotalCapacity',
    by = 'HasDetections')

ax.set_yscale('log')
plt.suptitle("")
Out[37]:
Text(0.5,0.98,'')
In [38]:
ax = final.boxplot(
    column='Census_SystemVolumeTotalCapacity',
    by = 'HasDetections')
ax.set_yscale('log')
plt.suptitle("")
Out[38]:
Text(0.5,0.98,'')
In [39]:
ax = final.boxplot(
    column='Census_TotalPhysicalRAM', 
    by = 'HasDetections')
ax.set_yscale('log')
plt.suptitle("")
Out[39]:
Text(0.5,0.98,'')

Below are the mode statistics for our categorical variables. From our review of the different categories, we found that the values are heavily weighted to 1 or 2 values. We believe this is because there are many different scenarios that make a machine vulnerable. We hope through unsupervised clustering on categorical features, we can learn more about how the data explains the store of detections.

In [40]:
for col in cols_categorical:
    display(final.groupby(col).agg({'MachineIdentifier':"count"}) )
MachineIdentifier
EngineVersion
1.1.11701.0 6
1.1.11903.0 1
1.1.12101.0 4
1.1.12400.0 5
1.1.12603.0 2
1.1.12706.0 1
1.1.12805.0 132
1.1.12902.0 251
1.1.13000.0 220
1.1.13103.0 245
1.1.13202.0 230
1.1.13303.0 421
1.1.13406.0 1
1.1.13407.0 434
1.1.13504.0 4579
1.1.13601.0 353
1.1.13701.0 236
1.1.13704.0 244
1.1.13804.0 470
1.1.13902.0 5
1.1.13903.0 484
1.1.14001.0 1
1.1.14002.0 7
1.1.14003.0 700
1.1.14102.0 1
1.1.14103.0 2
1.1.14104.0 4301
1.1.14201.0 10
1.1.14202.0 809
1.1.14303.0 19
1.1.14305.0 254
1.1.14306.0 1271
1.1.14405.2 1637
1.1.14500.2 15
1.1.14500.5 2184
1.1.14600.4 6396
1.1.14700.3 43
1.1.14700.4 36
1.1.14700.5 2036
1.1.14800.1 39
1.1.14800.3 5738
1.1.14901.3 77
1.1.14901.4 9474
1.1.15000.1 109
1.1.15000.2 11263
1.1.15100.1 132819
1.1.15200.1 139049
1.1.15300.5 2374
1.1.15300.6 4423
MachineIdentifier
RtpStateBitfield
0.0 6275
1.0 30
3.0 133
5.0 766
7.0 323405
8.0 1403
-1 1399
MachineIdentifier
AVProductsInstalled
1.0 213802
2.0 111728
3.0 7647
4.0 224
5.0 10
MachineIdentifier
AVProductsEnabled
0.0 2836
1.0 322254
2.0 8238
3.0 83
MachineIdentifier
OrganizationIdentifier
3.0 3
4.0 11
10.0 1
11.0 50
14.0 1
18.0 51041
20.0 3
21.0 2
27.0 266444
28.0 4
32.0 2
33.0 11
36.0 134
37.0 40
40.0 1
46.0 1
47.0 89
48.0 2088
49.0 10
0 13475
MachineIdentifier
Platform
windows10 318783
windows2016 362
windows7 3430
windows8 10836
MachineIdentifier
Processor
arm64 1
x64 295813
x86 37597
MachineIdentifier
OsVer
10.0.0.0 319119
10.0.1.0 13
10.0.3.0 13
6.1.0.0 35
6.1.1.0 3393
6.1.3.0 2
6.3.0.0 10828
6.3.0.16 1
6.3.1.0 3
6.3.3.0 4
MachineIdentifier
OsBuild
7600 34
7601 3396
9600 10836
10240 10622
10586 18006
14393 37805
15063 45207
16299 104693
17134 102566
17634 1
17655 1
17661 4
17666 1
17672 2
17677 3
17682 2
17686 6
17692 45
17713 9
17730 1
17733 7
17735 14
17738 32
17741 6
17744 34
17746 12
17751 13
17754 12
17755 6
17758 24
17760 4
17763 4
18234 1
18237 2
MachineIdentifier
OsSuite
16 28
49 1
256 101087
272 276
274 1
305 39
400 17
768 231953
784 9
MachineIdentifier
OsPlatformSubRelease
prers5 246
rs1 37805
rs2 45207
rs3 104693
rs4 102566
th1 10622
th2 18006
windows7 3430
windows8.1 10836
MachineIdentifier
SkuEdition
Cloud 572
Education 391
Enterprise 604
Enterprise LTSB 547
Home 230442
Invalid 2586
Pro 98179
Server 90
MachineIdentifier
SmartScreen
Block 533
ExistsNotSet 175654
Off 5248
Prompt 643
RequireAdmin 148372
Warn 2961
MachineIdentifier
Census_MDC2FormFactor
AllInOne 31388
Convertible 18177
Desktop 46802
Detachable 7402
LargeServer 12
LargeTablet 2563
MediumServer 44
Notebook 222938
PCOther 2920
SmallServer 320
SmallTablet 845
MachineIdentifier
Census_ProcessorManufacturerIdentifier
1.0 78557
3.0 56
5.0 254797
10.0 1
MachineIdentifier
Census_PrimaryDiskTypeName
HDD 264047
Other 12832
SSD 43158
UNKNOWN 13374
MachineIdentifier
Census_ChassisTypeName
AllinOne 23880
BusExpansionChassis 24
Convertible 2721
Desktop 47501
Detachable 1442
HandHeld 1622
Laptop 17690
LowProfileDesktop 3612
LunchBox 179
MainServerChassis 604
MiniPC 339
MiniTower 1969
MultisystemChassis 1
Notebook 209798
Other 5902
Portable 13572
RackMountChassis 86
SealedCasePC 1
SpaceSaving 1617
StickPC 1
SubNotebook 51
Tablet 347
Tower 452
MachineIdentifier
Census_PowerPlatformRoleName
AppliancePC 152
Desktop 72136
EnterpriseServer 257
Mobile 243379
Other 833
SOHOServer 1883
Slate 11916
Workstation 2855
MachineIdentifier
Census_OSArchitecture
amd64 295852
arm64 1
x86 37558
MachineIdentifier
Census_OSBranch
rs1_release 40801
rs2_release 46144
rs3_release 46984
rs3_release_svc_escrow 56842
rs3_release_svc_escrow_im 186
rs4_release 105310
rs5_release 177
rs_prerelease 32
rs_prerelease_flt 41
th1 2755
th1_st1 7897
th2_release 13708
th2_release_sec 12534
MachineIdentifier
Census_OSBuildNumber
10240 10652
10586 26242
14295 1
14316 1
14379 1
14393 40798
14915 2
14986 2
15063 46144
16193 1
16281 1
16291 1
16299 104010
17083 1
17115 1
17133 3
17134 105306
17634 1
17655 1
17661 4
17666 1
17672 1
17677 3
17682 2
17686 6
17692 45
17713 3
17733 7
17735 13
17738 35
17741 6
17744 32
17746 11
17751 12
17754 11
17755 6
17758 28
17760 3
17763 10
18234 1
18237 1
18242 1
MachineIdentifier
Census_OSEdition
Cloud 625
CloudN 1
Core 66706
CoreCountrySpecific 9
CoreN 58
CoreSingleLanguage 167513
Education 390
EducationN 10
Enterprise 619
EnterpriseN 3
EnterpriseS 547
EnterpriseSN 6
Professional 93541
ProfessionalEducation 2388
ProfessionalEducationN 6
ProfessionalN 624
ProfessionalWorkstation 2
ProfessionalWorkstationN 1
ServerDatacenter 1
ServerDatacenterEval 17
ServerSolution 40
ServerStandard 231
ServerStandardEval 73
MachineIdentifier
Census_OSSkuName
CLOUD 624
CLOUDN 1
CORE 66699
CORE_COUNTRYSPECIFIC 9
CORE_N 58
CORE_SINGLELANGUAGE 167498
DATACENTER_EVALUATION_SERVER 17
DATACENTER_SERVER 1
EDUCATION 393
EDUCATION_N 10
ENTERPRISE 619
ENTERPRISE_N 3
ENTERPRISE_S 547
ENTERPRISE_S_N 6
PROFESSIONAL 95948
PROFESSIONAL_N 629
PRO_WORKSTATION 2
PRO_WORKSTATION_N 1
SB_SOLUTION_SERVER 40
STANDARD_EVALUATION_SERVER 72
STANDARD_SERVER 233
UNLICENSED 1
MachineIdentifier
Census_OSInstallTypeName
Clean 2307
CleanPCRefresh 1616
IBSClean 51905
Other 36399
Refresh 11922
Reset 37915
UUPUpgrade 62656
Update 75099
Upgrade 53592
MachineIdentifier
Census_OSInstallLanguageIdentifier
3.0 3
4.0 1
5.0 70
7.0 273
8.0 32260
9.0 184804
10.0 113440
13.0 7
14.0 87
15.0 8
17.0 2
18.0 25
19.0 75
20.0 64
24.0 8
25.0 12
26.0 60
27.0 5
28.0 2
29.0 34
30.0 1
33.0 3
35.0 10
37.0 47
39.0 9
Unknown 2101
MachineIdentifier
Census_OSUILocaleIdentifier
20 3
26 66
30 271
31 31312
34 186476
35 114765
36 2
37 58
48 5
49 87
54 1
58 8
64 2
72 24
74 75
83 67
109 6
115 13
119 60
120 4
123 2
125 34
128 1
140 3
148 13
158 44
160 9
MachineIdentifier
Census_OSWUAutoUpdateOptionsName
AutoInstallAndRebootAtMaintenanceTime 17561
DownloadNotify 362
FullAuto 115889
Notify 86081
Off 615
UNKNOWN 112903
MachineIdentifier
Census_GenuineStateName
INVALID_LICENSE 19782
IS_GENUINE 304454
OFFLINE 8710
UNKNOWN 465
MachineIdentifier
Census_ActivationChannel
OEM:DM 144425
OEM:NONSLP 5161
Retail 176191
Retail:TB:Eval 90
Volume:GVLK 7492
Volume:MAK 52
MachineIdentifier
Census_FlightRing
Disabled 84
NOT_SET 10761
RP 276
Retail 313397
Unknown 8596
WIF 116
WIS 181
MachineIdentifier
Wdft_RegionIdentifier
1.0 19
2.0 4
3.0 24
4.0 89
5.0 9
6.0 12
7.0 15
8.0 3
9.0 11
10.0 320698
11.0 29
12.0 19
13.0 1419
14.0 74
15.0 49
10 10937

Visualize Attributes

AvSigVersion

In the analysis, AvSigVersion shows the majority of machines using the latest 2 of the 3 builds (1.275 & 1.273). Domain knowledge would tell us that new definition set would have the latest detections for malware, but data tells us otherwise.

In [41]:
# Extrac 1.XX Release values
AvSigVersion_split = final["AvSigVersion"].str.rsplit(pat=".",expand=True)
final["AvSigVersion_1_x"] = AvSigVersion_split.loc[:,0]+"."+AvSigVersion_split.loc[:,1]
final["AvSigVersion_1_x_xx"] = AvSigVersion_split.loc[:,0]+"."+AvSigVersion_split.loc[:,1]+"."+AvSigVersion_split.loc[:,2].str[:2]

av_sig = pd.crosstab(
    [
        final["AvSigVersion_1_x"]
    ], 
    final.HasDetections.astype(bool)
)

av_sig.plot(kind='barh' , figsize = (10,30))
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1bc30dcb208>

A more detailed breakdown of AvSigVersion_1_x of 1.273 & 2.75 shows that users are not upgrading their builds thus increasing their risk of an attack.

In [42]:
filted = final.AvSigVersion_1_x.isin(['1.275','1.273'])
# filtered = final.AvSigVersion_1_x == "" | final.AvSigVersion_1_x == ""
av_sig = pd.crosstab(
    [
        final.loc[filted,"AvSigVersion_1_x_xx"]
    ], 
    final.HasDetections.astype(bool)
)


av_sig.plot(kind='barh',figsize = (10,30), )
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x1bc4b7da1d0>

AVProductsInstalled

The below plot for shows that AVProductsInstalled cateogory value 1 and 2 have different success rate for attacks on the machine. We do believe this data is ordinal and related to the level of support the AV has or another competetive product. However, Microsoft did not release this information.

In [43]:
av_enabled = pd.crosstab(
    [
        final["AVProductsInstalled"]
    ], 
    final.HasDetections.astype(bool)
)

av_enabled.plot(kind='bar')# , figsize = (10,30))
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x1bc45e41390>

Smart Screen

Smart screen is a feature offered on Defender that helps block malware through your web browser. The data provided had 6 categories, however the Defender application only shows "Block", "Warn", or "Off". We plotted the data again showing the percentage of machines enabling Smart Screen who had dections on their machine for malware.

The chance of a succesful comromise with Malware in Enable mode decreases by about 50% as compared to without SmartScreen enabled.

In [44]:
snart_screen = pd.crosstab(
    [
        final["SmartScreen"]
    ], 
    final.HasDetections.astype(bool)
)

snart_screen.plot(kind='bar')# , figsize = (10,30))
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x1bc48d1d048>
In [45]:
final_save=final.copy()
df_ss=final_save[['SmartScreen','HasDetections']].copy()

df_ss.SmartScreen.replace({"Block":"Enable","ExistsNotSet":"Disable","Off" :"Disable",
                                "Prompt" :"Enable","RequireAdmin":"Enable","Warn":"Enable"},inplace=True)

df_ss.SmartScreen.value_counts()
df_grouped_ss=df_ss.groupby(by=['SmartScreen'])
detection_rate_ss=df_grouped_ss.HasDetections.sum()/df_grouped_ss.HasDetections.count()

ss = pd.crosstab(
    [df_ss["SmartScreen"].astype(object)], 
    df_ss.HasDetections.astype(bool))

ss_rate=ss.div(ss.sum(1).astype(float),axis=0)
ss_rate.plot(kind='barh',stacked=True)
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x1bc30a70438>

Explore Joint Attributes

The original submission for Explore Joint Attributes are replaced with the following updates.

Hardware Configurations

Hardware configuration is investigated for joint attributes exploration. Correlation among various hardware configurations are analyzed.

Interesting relationship is also found between machine's FormFactor to ProcessorCoreCount, PrimaryDiskTotalCapacity, SystemVolumeTotalCapacity, TotalPhysicalRAM, InternalPrimaryDiagonalDisplaySizeInInches, and TotalResolution. By focusing of the median values of the mentioned Hardware configuration, classification can be made for machine Form Factor.

In [46]:
#from pandas.plotting import scatter_matrix
# scatter_matrix(final[cols_numerical], alpha=0.2, figsize=(30,30), diagonal='kde')


import seaborn as sns; sns.set(style="ticks", color_codes=True)

# df = (final[cols_numerical]-final[cols_numerical].mean())/(final[cols_numerical].std())

# display(scl_numericals.describe())

df = final[np.append("Census_MDC2FormFactor",cols_numerical)].copy()
# df["Census_MDC2FormFactor"] = final["Census_MDC2FormFactor"]

# # Apply Log Transformation
df["Census_SystemVolumeTotalCapacity"] = np.log(df.Census_SystemVolumeTotalCapacity)
df["Census_TotalPhysicalRAM"] = np.log(df.Census_TotalPhysicalRAM)
df["Census_PrimaryDiskTotalCapacity"] = np.log(df.Census_PrimaryDiskTotalCapacity)

# Simplify columns names for easier visual analysis 
df.columns = [col.split("_",1)[1] for col in df.columns]

df.rename(index=str, columns={"InternalPrimaryDisplayResolutionHorizontal": "H_Resolution", 
                              "InternalPrimaryDisplayResolutionVertical": "V_Resolution",
                              "InternalPrimaryDiagonalDisplaySizeInInches": "DisplaySize"}, inplace=True)

Below scatter plot shows the relationship among all numerical features which are basically consists of Machine configuration: Processor Count, Primary Disk Total Capacity, System Volume Total Capacity, Total Physical RAM, InternalPrimaryDiagonalDisplaySizeInInches, InternalPrimaryDisplayResolutionHorizontal, and InternalPrimaryDisplayResolutionVertical.

The resolution features InternalPrimaryDisplayResolutionHorizontal and InternalPrimaryDisplayResolutionVertical is highly correlated. In order to adjust for its correlation, we will create a new variable TotalResolution by multiplying the 2 values.

There is also correlation between SystemVolumeTotalCapacity and PrimaryDiskTotalCapacity. There also appears to be an outlier where Primary Volume is Less then System Volume. Since both Disk Capacity values are correlated, we will select PrimaryDiskTotalCapacity during our modeling.

As expected as well as a relationship between Census_ProcessorCoreCount and Census_TotalPhysicalRAM. The majority of the data is exteremly right skewed therefore the data needed to be log transformed. Addtionaly, the visual showing of "zebra" stripes indicate this data may be better suited as a categorical variable.

In [47]:
sns.set(font_scale=1.05)
sns.pairplot(df)
plt.show()
In [48]:
#Creating new Feature "TotalDisplayResolution"
df["TotalDisplayResolution"] = df.H_Resolution * df.V_Resolution
del df["H_Resolution"]
del df["V_Resolution"]
del df["SystemVolumeTotalCapacity"]

Form Factor Classification

By focusing of the median values of the mentioned Hardware configuration, classification can be made for machine Form Factor.

In [50]:
df_grouped = df.groupby("MDC2FormFactor").agg({
    "ProcessorCoreCount":"median",
    "PrimaryDiskTotalCapacity":"median",
    "TotalPhysicalRAM":"median",
    "DisplaySize":"median",
    "TotalDisplayResolution":"median"
})
df_grouped
Out[50]:
ProcessorCoreCount PrimaryDiskTotalCapacity TotalPhysicalRAM DisplaySize TotalDisplayResolution
MDC2FormFactor
AllInOne 2.0 13.768282 8.317766 19.5 1440000.0
Convertible 4.0 13.075146 8.317766 13.2 1049088.0
Desktop 4.0 13.075146 8.317766 18.9 1296000.0
Detachable 4.0 10.303069 7.624619 10.3 1024000.0
LargeServer 7.0 12.242262 9.703999 16.3 968268.0
LargeTablet 4.0 10.302935 7.624619 11.6 1049088.0
MediumServer 16.0 13.054700 10.109526 17.2 1049088.0
Notebook 2.0 13.075146 8.317766 14.0 1049088.0
PCOther 4.0 11.648094 7.624619 14.0 1049088.0
SmallServer 4.0 13.768248 9.010913 17.1 1047552.0
SmallTablet 4.0 10.302935 6.931472 8.0 1024000.0

Heatmap of Hardware Configuration

Below is a heatmap of the Hardware configuration. We are able to confirm our relationship with Processor Cores and Physical memory with correlation of >0.6 We will chose to use both values as they explain different things on a machine. Other than that, there is no additional features that highly correlated.

In [51]:
# Generate a mask for the upper triangle
sns.set(font_scale=1)
mask = np.zeros_like(df.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
ax = sns.heatmap(df.corr(), mask=mask, cmap=cmap, vmax=1, center=0,
            square=False, linewidths=.5, cbar_kws={"shrink": .5},annot=True)

Software Configuration

There are several fields of AVInstalled and AVEnabled that are cateogirized as 0,1,2, etc. but we are not sure what this data means. In order to understand these fields we created a heatmap between different features (excluding builds) to determine if there could be a relationship. We separate the heatmap for positive correlation and negative correlation for easy viewing.

The heatmap below shows possitive correlation of ~0.6 for FlightRing_NOT_SET to ActivationChannel_Volume and DownloadNotify to ActivationChannel_Retail.

In [38]:
# Create Dummy Variables
RtpStateBitfield = pd.get_dummies(final["RtpStateBitfield"],prefix="RtpStateBitfield")
AVProductsInstalled = pd.get_dummies(final["AVProductsInstalled"],prefix="AVProductsInstalled")
AVProductsEnabled = pd.get_dummies(final["AVProductsEnabled"],prefix="AVProductsEnabled")
SmartScreen = pd.get_dummies(final["SmartScreen"],prefix="SmartScreen")
Census_FlightRing = pd.get_dummies(final["Census_FlightRing"],prefix="Census_FlightRing")
Census_OSWUAutoUpdateOptionsName = pd.get_dummies(final["Census_OSWUAutoUpdateOptionsName"],prefix="Census_OSWUAutoUpdateOptionsName")
Census_GenuineStateName = pd.get_dummies(final["Census_GenuineStateName"],prefix="Census_GenuineStateName")
Census_ActivationChannel = pd.get_dummies(final["Census_ActivationChannel"],prefix="Census_ActivationChannel")

df = pd.concat(
    (
        final[["Firewall","IsProtected","HasDetections"]],
        RtpStateBitfield,
        AVProductsInstalled,
        AVProductsEnabled,
        SmartScreen,
        Census_FlightRing,
        Census_OSWUAutoUpdateOptionsName,
        Census_GenuineStateName,
        Census_ActivationChannel
    ), axis=1
)

sns.set(font_scale=3)
fig, ax = plt.subplots(figsize=(40,40))
sns.heatmap(df.corr(), ax=ax, vmin=0.4)
plt.show()

The heatmap below shows negative correlation of >0.6 for features in the same category. This is expected and we can use them for one hot encoding for futures below:

  • RtpStateBitfield_0.0 and _7.0
  • AVProductsEnabled_0.0 and _1.0
  • AVProductsEnabledand 1.0 and _2.0
  • SmartScreen_ExistNotSet and _RequireAdmin
  • FlightRing_NOT_SET and _Retail
  • FlightRing_Retail and _Unknown
  • GenuineStateName_INVALID_LICENSE and _Genuine
  • Activation_CHANNEL and _Retail
In [42]:
sns.set(font_scale=3)
fig, ax = plt.subplots(figsize=(40,40))
sns.heatmap(df.corr(), ax=ax, vmax=-0.4)
plt.show()

AV Builds

The below heatmap shows us that EngineVersion is positively correlated with AVSigVersion. Through our reserch, we learned that AVSigVersion has the latest versions of malware definitions. During our building of the model, we will favor AVSigVersion instead of EngineVersion.

In [47]:
AppVersion_split = final["AppVersion"].str.rsplit(pat=".",expand=True)
final["AppVersion_x_x"] = AppVersion_split.loc[:,0]+"."+AppVersion_split.loc[:,1]#+"."+AppVersion_split.loc[:,2]

AvSigVersion_split = final["AvSigVersion"].str.rsplit(pat=".",expand=True)
final["AvSigVersion_x_x"] = AvSigVersion_split.loc[:,0]+"."+AvSigVersion_split.loc[:,1]

EngineVersion = pd.get_dummies(final["EngineVersion"],prefix="EngineVersion")
AppVersion_x_x = pd.get_dummies(final["AppVersion_x_x"],prefix="AppVersion_x_x")
AvSigVersion_x_x = pd.get_dummies(final["AvSigVersion_x_x"],prefix="AvSigVersion_x_x")


df = pd.concat(
    (
        EngineVersion,
        AvSigVersion_x_x,
        AppVersion_x_x
    ), axis=1
)

sns.set(font_scale=3)
fig, ax = plt.subplots(figsize=(40,40))
sns.heatmap(df.corr(), ax=ax, vmin=0.4)
plt.show()

Negative heatmap below shows negative correlation between AvSigVersion 1.273 and 1.277. This shows that a system that is upgraded to AvSigVersion1.277 is normally has the Revision of 1.273. Also, a similar negative correlation is seen for EngineVersion 1.1.15200.1 and 1.1.15300.6. From the previous plot, we know that is expected given the high positive correlation between AvSigVersion to EngineVersion.

In [48]:
sns.set(font_scale=3)
fig, ax = plt.subplots(figsize=(40,40))
sns.heatmap(df.corr(), ax=ax, vmax=-0.4)
plt.show()

Explore Attributes and Class

Diskspace & Memory compared to Devices

Further details from the boxplots, the density of Detections for Diskspace and Memory was analyzed.. In the violin plot for Census_TotalPhysicalRAM . We can visually see that Median server, as the memory increases there is a higher likely hood of vulnerability. The remaining servers, with the exception of Large (low detections), memory is not a high factor between detections. The feature Census_PrimaryDiskTotalCapacity also shows a similar pattern, where if you are a server with more diskspace you are more likely to be attacked by malware.

In [235]:
cmap = sns.diverging_palette(220, 10, as_cmap=True)
f, ax = plt.subplots(figsize=(20,30))

final["Census_TotalPhysicalRAM_gb"] = final.Census_TotalPhysicalRAM / 1000

sns.violinplot(x="Census_MDC2FormFactor", y="Census_TotalPhysicalRAM_gb", hue="HasDetections", data=final, 
               split=True, inner="quart")
Out[235]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ad9c371d0>
In [236]:
cmap = sns.diverging_palette(220, 10, as_cmap=True)
f, ax = plt.subplots(figsize=(20,30))
sns.violinplot(x="Census_MDC2FormFactor", y="Census_PrimaryDiskTotalCapacity", hue="HasDetections", data=final, 
               split=True, inner="quart")
Out[236]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a6435f080>

Defender Version Details

While researching different builds of Defender, we learned there is the version of the actual Application, AppVersion, along with definition version to detect Malware, AvSigVersion, and Engine Version EngineVersion. In order to understand its relationship between detections, we compared combinations of different versions. Visually the 2 largest groups of version combinations has the make chance for an attack. This may be an indication that attackers know what the adoption rate for versions and targeting these versions.

In [237]:
# get only the 4.X values
AppVersion_split = final["AppVersion"].str.rsplit(pat=".",expand=True)
final["AppVersion_reduce"] = AppVersion_split.loc[:,0]+"."+AppVersion_split.loc[:,1]#+"."+AppVersion_split.loc[:,2]

defenders = pd.crosstab(
    [
        final["EngineVersion"].fillna(1),
        final["AppVersion_reduce"].fillna(1),
        final["AvSigVersion_1_x"]
    ], 
    final.HasDetections.astype(bool)
)

defenders.plot(kind='barh',stacked=False, figsize = (10,70))
Out[237]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2bb3e4e0>

Antivirus and Firewall

Through domain knowledge we know firewall is important to prevent attacks. In order to visualize if a Firewall is an important feature, we grouped the variable Firewall with AVProductsEnabled to determine its effectiveness. The rate for each group is very similar. This is because user behavior indicates more attacks are done by downloads versus straight attacks on the operating system.

In [253]:
final['AVProductsEnabled'] = final['AVProductsEnabled'].astype(int)
final['av_status']=pd.cut(final.AVProductsEnabled,[-1,0,final.AVProductsEnabled.max()],2,labels=['NoAV','AV'])
final['fw_status']=pd.cut(final.Firewall,[-1,0,final.Firewall.max()],2,labels=['NoFW','FW'])

final['Def_status'] = final[['av_status', 'fw_status']].apply(lambda x: '_'.join(x), axis=1)

df_avfw=final[['Def_status','HasDetections']].copy()

df_grouped_avfw=df_avfw.groupby(by=['Def_status'])
detection_rate=df_grouped_avfw.HasDetections.sum()/df_grouped_avfw.HasDetections.count()
detection_rate.plot(kind='bar')
Out[253]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ab7377f60>

New Features

One of the key data points that is absent in this dataset is the actual risk that an attacker would compromise the machine based on known vulnerabilities because simply taking a data set without understanding the vulnerabilities related to exposure, does not tell the entire story. Without this data, the machine learning models will miss critical attributes of information in determining if malware could compromise the host. While the Microsoft competition purpose is to determine if machines can be compromised, it is vital to understand the actual risk of the machine to determine if an attacker can actually deploy malware on a machine. From an attacker standpoint, the methodology that is used to compromise a machine is to research whether a machine has vulnerabilities with tools like nMap or Nessus so they can launch the attack. Attackers cannot deploy malware without exploiting vulnerabilities and only understanding the operating system levels, anti-virus dat versions, operating system configuration has no real context from an attacker standpoint to determine whether a machine can be compromised.

The first data element that is missing is the Common Vulnerabilities and Exposures (CVE) data to determine if the machine is vulnerable. In order to gather this data, the OS build field will need to be compared to the Microsoft version number. For example the Build number of 17763.253 matches to Version 1809 (e.g.Microsoft Build to Version Conversion). The second step is to find all the CVE's related with that version of Operating System. For example, version 1809 has 73+ related CVEs. (e.g.Microsoft Version to Listing of CVE Vulnerabilites). Once this data is obtained, the data can be aggregated by severity based on the CVE Score with the highest exploitable vulnerability as the rating.

In [6]:
# Include new features
ms_versions = pd.read_csv("data/vulnerabilities.csv")
final["os_build"] = final.Census_OSVersion.str.slice(5,10).astype(int)

final = pd.merge(final, ms_versions, on='os_build', how='left')

ms_versions[cols_cve].drop_duplicates()
Out[6]:
version released_date total_vulnerabilities dos code_execution overflow memory_corruption xss bypass_something gain_Information gain_privileges
0 1903 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1809 11/13/18 74.0 2.0 23.0 15.0 0.0 1.0 6.0 19.0 0.0
25 1803 4/30/18 163.0 8.0 43.0 23.0 2.0 1.0 19.0 36.0 0.0
29 1709 10/17/17 265.0 18.0 52.0 26.0 1.0 1.0 37.0 78.0 0.0
33 1703 4/11/17 447.0 34.0 90.0 35.0 3.0 0.0 45.0 159.0 3.0
36 1607 8/2/16 568.0 45.0 118.0 53.0 6.0 2.0 58.0 181.0 58.0
40 1511 8/15/15 458.0 38.0 91.0 36.0 9.0 1.0 39.0 155.0 94.0
41 1507 7/29/15 514.0 33.0 121.0 54.0 8.0 1.0 44.0 144.0 84.0

The second data point that is missing is related to the Antivirus program. The "EngineVersion", "AppVersion", and "AVSigVersion" all need to be transformed related to how far they are out of date compared to the current date. While attackers use vulnerabilities to deploy malware, is a system is out of date, the machine will have a higher degree of susceptibility.

The last data element that is missing is related to the firewall. While under most circumstances’ firewalls are a good way at protecting a machine, the exploitability factors are not related to whether the firewall, inbound or outbound, is on or off, but the determination of what ports or services are open on the firewall external to the machine. However, this data cannot be captured through an external data source because this is something local to the machine that would have had to be captured as part of the original dataset.

Exceptional Work

SUCCESS MEASURE You have free reign to provide additional analyses. One idea: implement dimensionality reduction, then visualize and interpret the results.

For the exceptional work bonus points, we believe that the subject matter expertise leveraged in the business understand in this report. We performed additional research to actually understand the problem that Microsoft is facing in this Kaggle competition and presented the challenges of the cybersecurity maket and attack landscape in context to the entire industry and the problem that Microsoft is trying to solve.

The second piece of exceptional work in this report is related to the additional features that need to be captured to understand whether or not a machine is vulnerability from a security perspective. Three key features are missing which include total vulnerabilities, Antivirus program aging, and open firewall ports and services. All of these data elements would tremendously help aid in the predication of whether the machine could be compromised with malware. As part of our EDA, the team did bring in the data set that compared the OS and Patch version to vulnerabilities. While this data is important for this stage in the project, we fully expect that this will help with our prediction in the future classes

The thrid piece of exception work was data mining and munging the existing dataset. Since the data set contain over 300,000 rows and 82 variables, we spent a lot of time cleaning the data. We were able to delete columumns that had over 95% invalid data and munged the data in the columns that were missing data and inserted either median or mode data depending on whether is was numberical or catagorical.

In [ ]: